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Abstract. Computing accurate WCET on modern complex architectures is a challenging task. A lot 
of attention has been devoted to this problem in the last decade but there are still some open issues. 
First, the control flow graph (CFG) of a binary program is needed to compute the WCET and this 
CFG is built using internal knowledge of the compiler that generated the binary code; moreover once 
constructed the CFG has to be manually annotated with loop bounds. Second, the algorithms to com- 
pute the WCET (combining Abstract Interpretation and Integer Linear Programming) are tailored for 
specific architectures: changing the architecture (e.g., replacing an ARM7 by an ARM9) requires the 
design of a new ad hoc algorithm. Third, the tightness of the computed results (obtained using the 
available tools) are seldom compared to actual execution times measured on the real hardware. 
In this paper we address these problems. We first describe a fully automatic method to compute a CFG 
based solely on the binary program to analyse. Second, we describe the model of the hardware as a 
product of timed automata, and this model is independent from the program description. The model of 
a program running on a hardware is obtained by synchronizing (the automaton of) the program with the 
(timed automata) model of the hardware. Computing the WCET is reduced to a reachability problem 
on the synchronised model and solved using the model-checker Uppaal. Finally, we present a rigorous 
methodology that enables us to compare our computed results to actual execution times measured on a 
real platform, the ARM920T. 
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1 Introduction 

Embedded real-time systems are composed of a set of tasks (software) that run on a given architecture 
(hardware). These systems are subject to strict timing constraints that must be enforced by a scheduler. 
Determining if a given scheduler can schedule the system is possible only if some bounds are known about 
the execution times of each task. Performance wise, determining tight bounds is crucial as using rough 
over-estimates might either result in a set of tasks being wrongly declared non schedulable, or leads to the 
choice of an overpowered and expensive hardware where a lot of computation time is lost. 
The WCET Problem. Given a program P, some input data d and the hardware H , the execution- time of 
P on input d on H, is measured as the number of cycles of the fastest component of the hardware i.e., the 
processor. The program is given in binary code or equivalently in the assembly language of the target 
processor^The worst-case execution-time of program P on hardware H, WCET(P, H), is the supremum 
on all input data d, of the execution-times of P on input dfovH. The WCET problem asks the following: 
Given P and H, compute WCET(P, H). 

In general, the WCET problem is undecidable because otherwise we could solve the halting problem. 
However, for programs that always terminate and have a bounded number of paths, it is computable. Indeed 
the possible runs of the program can be represented by a finite tree. Notice that this does not mean that the 
problem is tractable though. 

If the input data are known or the program execution time is independent from the input data, the tree 
contains a single path and it is usually feasible to compute the WCET. Likewise, if we can determine some 

* Author supported by a Marie Curie International Outgoing Fellowship within the 7th European Community Frame- 
work Programme. 

1 When we refer to the "source" code, we assume the program p was generated by a compiler, and refer to the high- 
level program (e.g., in C) that was compiled into P. 



input data that produces the WCET (which can be as difficult as computing the WCET itself), we can 
compute the WCET on a single-path program. 

It is not often the case that the input data are known or that we can determine an input that produces the 
WCET. Rather the (values of the) input data are unknown, and the number of paths to be explored might be 
extremely large: for instance, for a Bubble Sort program with 100 data to be sorted, the tree representing 
all the runs of the (assembly) program on all the possible input data has more than 2 50 nodes. Although 
symbolic methods (e.g., using BDDs) can be applied to analyse some programs with a huge number of 
states, they will fail to compute the exact WCET on Bubble Sort by exploring all the possible paths. 

Another difficulty of the WCET problem stems from the increasingly complex architectures embedded 
real-time systems are running on. They feature multi-stage pipelines and fast memory components like 
caches that both influence the WCET in a complicated manner. It is then a challenging problem to determine 
a precise WCET even for relatively small programs running on complex architectures. 
Methods and Tools for the WCET Problem. The reader is referred to l33l for an exhaustive presentation 
of WCET computation techniques and tools. There are two main classes of methods for computing WCET: 

- Testing-based methods. These methods are based on experiments i.e., running the program on some 
data, using a simulator of the hardware or the real platform. The execution time of an experiment is 
measured and, on a large set of experiments, maximal and minimal bounds can be obtained. A maximal 
bound computed in this way is unsafe as not all the possible paths have been explored. These methods 
might not be suitable for safety critical embedded systems but they are versatile and rather easy to 
implement. 

RapiTime l28l (based on pWCET (9)) and Mtime |29l are measurement tools that implement this 
technique. 

- Verification-based methods. These methods often rely on the computation of an abstract graph, the 
control flow graph (CFG), and an abstract model of the hardware. Together with a static analysis tool 
they can be combined to compute WCET. The CFG should produce a super- set of the set of all feasible 
paths. Thus the largest execution time on the abstract program is an upper bound of the WCET. Such 
methods produce safe WCET, but are difficult to implement. Moreover, the abstract program can be 
extremely large and beyond the scope of any analysis. In this case, a solution is to take an even more 
abstract program which results in drifting further away from the exact WCET. 

Although difficult to implement, there are quite a lot of tools implementing this scheme: Bound-T 1301 , 
OTAWA [7], TuBound [27], Chronos |24|, SWEET (El and aiT BTTTll are static analysis-based tools 
for computing WCET. 

The verification-based tools mentioned above rely on the construction of a control flow graph, and 
the determination of loop bounds. This can be achieved using user annotations (in the source code) or 
sometimes inferred automatically. The CFG is also annotated with some timing information about the 
cache misses/hits and pipeline stalls, and paths analysis is carried out on this model e.g., by Integer Linear 
Programming (ILP). The algorithms implemented in the tools use both the program and the hardware 
specification to compute the CFG fed to the ILP solver. The architecture of the tools themselves is thus 
monolithic: it is not easy to adapt an algorithm for a new processor. This is witnessed by the WCET' 08 
Challenge Report [ 20 ] that highlights the difficulties encountered by the participants to adapt their tools 
for the new hardware in a reasonable amount of time. Moreover, the results of the computation are not 
compared to actual execution times measured on a real platform. Notice that aiT reports comparisons 
with ARMulator (the ARM simulator of the RealView Development Suite) but this simulator is not cycle 
accurate as emphasised in ARM documentation for ARMulator, Application Notes 93 151 : 

ARMulator consists ofC based models of ARM cores and as such cannot be guaranteed to completely repro- 
duce the behaviour of the real hardware. If 100% accuracy is required, an HDL model should be used. 

Outline of the Paper. Section [2] presents our contribution and related work. In Section [3] we give the spec- 
ification of the hardware we use in the experiments. Section]?] gives some formal definitions for program 
execution on a given hardware. Section [5] presents a modular way to compute the WCET of a given pro- 
gram. Section|7]presents the technique we use to automatically build the CFG. Section[8]gives the UPPAAL 
timed automata models of the hardware. In Section [9] we report on the implementation and tool chain we 
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have developed. Section 10 describes the methodology we use to compare our computed WCET with actual 
WCET and contains a summary of the results. Section 1 1 concludes with our ongoing and future work.. 



2 Related Work 

WCET and Model-Checking, Only a few tools use model-checking techniques to compute WCET. Con- 
sidering that (i) modern architectures are composed of concurrent components (the units of the different 
stages of the pipeline, the caches) and (ii) the synchronization of these components depends on timing con- 
straints (time to execute in one stage of the pipeline, time to fetch data from the cache), formal models like 
timed automata [ 5 ] and state-of-the-art real-time model-checkers like UPPAAL 122181 appear well-suited 
to address the WCET problem. 

In (26), A. Metzner showed that model-checkers could well be used to compute safe WCET on the 
CFG for programs running on pipelined processors with an instruction cache. More recently, Lv et al. l25l 
combined AI techniques with real-time model-checking (and UPPAAL) to compute WCET on multicore 
platforms. 

In |2H . B. Huber and M. Schoeberl consider Java programs and compare ILP-based techniques with 
model-checking techniques using the model-checker UPPAAL. Model-checking techniques seem slower 
but easily amenable to changes (in the hardware model). The recommendation is to use ILP tools for large 
programs and model-checking tools for code fragments. 

The use of timed automata (TA) and the model-checker UPPAAL for computing WCET on pipelined 
processors with caches was reported in 1151141 where the METAMOC method is described. METAMOC 
consists in: 1) computing the CFG of a program, 2) composing this CFG with a (network of timed automata) 
model of the processor and the caches. Computing the WCET is then reduced to computing the longest path 
(timewise) in the network of TA. 

The previous framework is very elegant yet has some shortcomings: (1) METAMOC relies on a value 
analysis phase that may not terminate, (2) some programs cannot be analysed (if they contain register- 
indirect jumps), (3) some manual annotations are still required on the binary program, e.g., loop bounds 
and (4) the unrolling of loops is not safe for some cache replacement policies (FIFO). 

In a previous work ifTOl we have already reported some similar results on the computation of WCET 
using TA. In [10], what is similar to METAMOC is the use of network of timed automata to model the 
cach^Jand pipeline stages. However, in this preliminary work we had chosen to: (1) build the CFG without 
any need for annotations and (2) use a new and very compact encoding of the program and pipeline stages' 
states. In contrast METAMOC uses a values analysis phase and requires loop bounds annotations to obtain 
an (unfolded) graph of the program. 

Our Contribution. Compared to our previous work fTOlL this paper contains three new original contribu- 
tions: (1) an automatic method to compute a CFG and a reduced abstract program equivalent WCET- wise 
to the original program; (2) detailed hardware formal models and (3) a rigourous methodology to make it 
possible the comparison of computed WCET to actual WCET measured on a real hardware. 

3 Architecture of the ARM920T 

The development board we model and use in the experiment section. It is an Armadeus APF9328 board iTTTl 
which bears a 200MHz Freescale MC9328MXL micro-controller with an ARM920T processor. The pro- 
cessor embeds an ARM9TDMI core that implements the ARM v4T architecture. An overview of the 
ARM920T architecture is given in Fig. [T] The component we model in Section [8] are highlighted in or- 
ange. 

3.1 Reduced Instruction Set Computer Architecture 

The ARM architecture is a Reduced Instruction Set Computer (RISC) architecture. The instruction set con- 
sists of fixed size instructions and a few simple addressing modes. There are 16 general purpose registers ro 

2 Note that a similar model is reportedly due to A. P. Ravn in Ell . 
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Fig. 1. Simplified block diagram of the ARM920T. Gray arrows are address buses/connections. White 
arrows are data buses/connections. Both are 32 bits wide. Coprocessor 15 hosts control registers for the 
caches and the MMUs. Actually Register R13 (which should not be confused with the ARM9TDMI rl3 
register presented in section [XT] ) is not duplicated and is located in Coprocessor 15. It hosts a process ID 
used for virtual address to physical address translation. Some blocks like the Write back Physical TAG 
RAM and various debug and/or coprocessor interfaces are not shown. 
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to 7*15, specialized memory transfer instructions (load/store), and data-processing instructions that operate 
on registers only. Other interesting features are multiple load/store instructions and conditional execution 
of instructions (to improve data and execution throughput). 

Three of the general purpose registers are used in a specialized way. Register 7*13 is the stack pointer 
(we use sp in the sequel to refer to this register). Register 7*14 is the link register (Ir in the sequel) and hosts 
the return address of function calls. Register 7*15 is the program counter (pc in the sequel). 

An instruction is defined by a mnemonic^] (e.g., mov) and the operands. In the sequel, we let 1Z = 
{^0? • • • , Ti2? sp, Zr, pc} be the set of registers of the architecture and X be the (finite) set of RISC instruc- 
tions. 



3.2 Execution Pipeline 

The ARM920T uses a 5-stage execution pipeline, the purpose of which is to execute concurrently the 
different tasks (Fetch, Decode, Execute, Memory, Writeback) needed to perform an instruction. The (nor- 
mal) flow of instructions in the pipeline is shown in Fig. [2] This optimal flow may be slowed down when 
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Fig. 2. Pipeline of the ARM920T: Instruction is fetched in F. Instruction decode and operand register ac- 
cesses are done in D. Execution is done in E. Load/store instructions do their memory accesses in M. 
Results are written back to registers in W. 



pipeline stalls occur. Most of the time, two independent consecutive instructions do not incur a stall and 
the throughput is 1 instruction/cycle. However in certain cases stalls can occur. Assume instruction ldr 
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Fig. 3. Load delay pipeline hazard. 



r 1 , [ sp , # ] is followed by add r , r 1 , # 1 : the first instruction loads register 7*1 (with the content of a 
memory cell) and the second uses n to compute rO. This sequence of instructions brings about a load delay 
depicted on Fig. [3] One stall cycle is inserted before processing instruction add rO , rl , #1 because the 
load instruction produces the operand needed (7*1) at the E stage of the add instruction at the end of its M 
stage. 

Sometimes the target address of a branch instruction is produced at the end of the E stage (e.g., condi- 
tional branching that needs the result of a comparison operation). The ARM920T does not implement any 
branch prediction mechanism. As a consequence fetching the next instruction can only be done after the 
branch instruction has completed the E stage: this causes a branch delay depicted Fig. [4] that results in 2 
stall cycles before the fetch of the branch target instruction can be performed. 



3 And the condition and flags (like the "s" flag). 
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Fig. 4. Branch delay pipeline hazard. 



3.3 Main Memory, Instruction and Data Cache, Write Buffer 

Both instruction and data caches have the same architecture. They are 16KB, 8-ways set associative caches. 
There are 64 sets and 512 32 bytes long lines. Replacement policy may be set to pseudo-random or round- 
robin (FIFO). Both caches implements allocate-on-read-miss i.e., a data is inserted in the cache if missing 
when a read is performed. 

The data cache may be configured in write-through (when data in the cache is modified, it is immedi- 
ately written to the main memory) or write-back (modified cached data are only written to main memory 
when needed) but does not implement allocate-on-write-miss: if non cached data is written to, they are not 
cached but instead written to main memory directly. So, even if configured in write-back, a write miss acts 
as a write-through. Each data cache line has 2 dirty bits (indicating that a cached item has been modified 
since last cached), one per half-line, to indicate the half-line must be written back when it is replaced. 

A 16-word write buffer helps to reduce stalls when a write to the main memory occurs because of a 
write miss or, if the cache is configured in write-back, when a dirty line has to be replaced. The write buffer 
is organized in 4 half-line entries to allow cache write-back on a half-line basis. 

Finally, transfers between the caches and main memory are serialized and the bus abstracted away. 



4 Program Semantics 

In this section we present the formal semantics for the execution of binary programs. We make the following 
assumptions on the binary programs we analyse: 

(Al) the termination of a program does not depend on input data, i.e., a program terminates for all input 
data; and 

(A2) reference to stack values is via the specialised register sp only|^] 

(A3) references to memory cells are independant from input data. This ensures that when an instruction 

computes the address of a memory cell, it is always defined. 
(A4) The programs do not contain recursive calls. 



4.1 Notations 

We let B = {true, false}. We have already introduced some notations: 1Z is the set of registers of 
the hardware, X is the (finite) set of instructions the hardware can perform, and M the (finite) set of 
main memory cells the program can access. In the sequel we will introduce a set of predicates V and for 
x G 1Z U V U M (set of registers or predicates or memory cells), \x\ denotes the content of x. A program 
state s is a valuation of the variables in 1Z U V U M i.e., a mapping from 7Z U V U M to V where V is a 
finite set e.g., 32-bit integers. We let S be the set of program states. 

As program instructions are located in main memory, we define the set of labelled instructions CX = 
M x X to be the set of pairs (£ : i) indicating that instruction i is stored at address i in main memory. 
Consequently, a program P is simply a subset (necessarily finite) of CX. We use the notation <t>: S —> S 
to denote the semantics of instruction i G CX. 

4 Note that these assumptions are not compulsory but they are made in the current implementation of our tool in 
the Compute CFG component (See Section [9). Moreover, they are satisfied by programs obtained using a compiler 
conforming to the ARM ABI |3|. However, the technique described here can be extended to encompass a more 
general framework. 
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Remark 1. The semantics is defined on labelled instructions which means that the semantics itself may 
depend on the address the instruction is stored at. This is actually the case for some instructions like 1 2 : 
ldr rO, [ p c , # 4 ] the semantics of which is "load register tq with the content of the memory cell located 
at offset 4 from the current value of pc", i.e., at address 12p] 



4.2 Example Program 

As a running example we take the binary program FIBO of Listing [TTT] It has been compiled (gcc) and 
de-assembled (ob jdump) using the GNU ARM tools from Codesourcery lfT2l . It computes Fib(30) the 
Fibonacci number uso with uq = 1, u\ = 1 and u n = u n -\ + u n -2,n > 2. A program is stored in 
memory and the memory address of each instruction is the leftmost decimal numberjjEach program has a 
designated initial instruction = (Ab *o) an d \pc\ = £o at the beginning of the execution of the program. 

To give the semantics of programs, we assume there is a set of variables V = {/e, #£,•••} to hold the 
truth values of the predicates used in the conditional instructions of the prograrrQ 

The semantics of program FIBO is given in terms of assignments to registers (on the right-hand side 
of each instruction in Listing \T7\) . Each instruction assigns a new value to register pc. except for branch- 
ing instructions the assignment is {pc} := {pc} + 4 and we omit it in this case. A comparison operator 
(e.g., line 24) sets the truth value of the predicates that are used later in the program (e.g., eq for instruction 
at line 24). The main loop of the computation is between line 24 and 52; notice that this optimized pro- 
gram (compiled with option -02) computes u n+ 2 in each round of this loop fa holds the value of n and is 
incremented twice in the body of the loop). 






<main> 


: /* starts at address 0; Mr] is the return address */ 





mov 


rl,#30 


In] 


= 30 


4 


mov 


r2,#2 


M 


= 2 


8 


add 


rl, rl, #1 


[n] 


= bi] + 1 


12 


add 


r2, r2, #1 


M 


= b2] + 1 


16 


mov 


r0,#0 


bo] 


= 


20 


mov 


r3,#l 


[rsl 


= i 


24 


cmp 


r2,rl 


[eg] 


= (M = [rl]) 


28 


add 


rO , rO , r3 


bo] 


= bo] + [r3] /* r = ii n */ 


32 


bxeq 


lr 


if (Nl) M := [Zr] else [pc] := 36 


36 


add 


r2, r2, #1 






40 


add 


r2, r2, #1 






44 


add 


r3, r3, rO 


[rs] 


= bs] + bo] 


48 


cmp 


r2,rl 


[eg] 


= (b 2 ] = [rl]) 


52 


add 


rO , rO , r3 






56 


bne 


24 


if ( 


-.[eg]) be] :=24 else [pc] := 60 


60 


bx 


lr 


be] 


= Pr] 



Listing 1.1. Program FIBO: computes Fibonacci 30 



4.3 Abstract Hardware Model 

The real hardware (Section [3]) consists of the pipelined processor, instruction and data caches, write buffer 
and the main memory of the computer. We abstract away the details of the communication medium (AMBA 
bus, MMlQ. We choose to treat the content of the main memory as a component of the program state and 
thus it is not part of the state of the hardware. The same remark applies for the register and we consider they 
are part of the program state. A state of the hardware is then defined by the states of the different stages of 
the pipeline and the states of the caches. 

5 In pipelined architecture, the actual memory address is translated due to pipelining. For example in the ARM9, the 
address is 8+ the offset that appears in the instruction. 

6 Instructions addresses are multiple of 4 in the ARM 32-bit instruction set. 

7 In the ARM 32-bit instruction set, the truth values of these predicates are stored in the status bits N, Z, C, V. 

8 The MMU is considered to be programmed to make a translation from a virtual address page v to the physical 
address page p such as p = v. 
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As we are only interested in computing execution times, we can consider that the hardware is an abstract 
machine H that reads sequences of triples (t,A,d) G CX xMxl and outputs the time it takes to process 
such a sequence. A triple (i,A,d) consists of a (labelled) instruction i — (t : i) with i G M and i G I 
that references a set of (main memory) addresses in A and is performed if d = TRUE; if d = FALSE the 
instruction is a conditional instruction and the condition the instruction depends on last evaluated to FALSE. 
Such triples (and sequences thereof) contain enough information to compute the execution time: 



- pipeline stalls (see 3.2) can be inferred from the first component of the triple i that contains the full 
text of the instruction and thus the read/written registers and the value of d (whether the instruction is 
executed or not); 

- cache hits/misses (see [33} are completely determined by the set A. 

Examples of instructions for program FIBO (Listing [TTT] ) are : mov r 1 , #3 and 32 : bxeq lr. No- 
tice that there is no need for actual register values in H neither for performing the real computation as the 
timing of instructions in the pipeline and the cache is fully determined by the instruction (and its location in 
memory), whether it is performed or not (there are conditional instructions), the registers read from/written 
tc[^] and the memory addresses used in the instruction. The fact that there is no branch prediction in the 
pipeline of the hardware in the ARM920T makes things simpler but the framework we present extends to 
the case with branch prediction (see 1 10 ]). 

The execution time of a sequence of triples also depends on the initial state 7 of the hardware H . Given 
a finite sequence w — go#i#2 • • • Qn G (CX x M x B)* and an initial state 7 of H, time#(7, w) is the 
execution time of w from initial state 7 of H. It can be defined precisely using for instance the HDL model 
of the hardware. Notice that at this point, we do not require sequences of triples to be actual sequences 
produced by program P. 



4.4 Trace Semantics of a Program 

The execution of a program can be defined by an alternating sequence of program states and instructions. 
A run of a program P is a sequence g = sq lq si L\ S2 L2 • • • s n _i ^n-i Sn where is a program 
state and ik = (ik : h) is a labelled instruction with ik = Sk(pc) and such that Sk+i —<ik> (sfc). We let 
Runs(P) be the set of runs of P. 

The trace, TR(g), of the run g is the sequence Q§Q\Q2 * ■ * Qn £ (CX x M x B)* with q { = (t i: A { ^ di) 
where A{ is the set of memory addresses referenced by instruction ^ in state S{ and di G B indicates 
whether the instruction is actually executecp^] For instance, instruction ldr rO, [sp, #4 ]p]frorn a 
program state s with s(sp) = 12 references address 16 and is performed (unconditional). Instruction 12 8: 
addle r 1 , r 1 , # 1 is performed only if the last comparison set the predicate le TRUE and from program 
state s the next triple in the trace is (128: addle r 1 , r 1 , # 1 , 0, s (le)) . As there are multiple load and 
store instructions, we need sets of addresses to represent the memory cells referenced by an instruction: 
instruction stm sp, {rO , rl)p| references addresses 12 and 8. 

The execution time of a run g of P from initial state 7 of H is defined by time#(7, TR(g)). 

Program P has a set of initial states / (where \pc\ gives the initial instruction of P) and the con- 
tents of the registers, predicates and main memory can be in a finite set of values. Notice that there can 
be many initial states as the input data of P can range over large sets. P also has a set of final states, 
F, and we assume it can be defined using the value of register pc which gives the last instruction of 
P. The language Cf(P) of P is the set of traces generated by runs of P that start in / and end in F 
i.e., Cf (P) = {TR(g) | g 

— s i^i ' ' ' Sri—i^n—iSm Q G Runs(P) , S\ G /, s n G F}. As we assume that P 
always terminates for any input data, this language is finite (because the set of memory contents is finite). 

9 Some instructions (MUL/MLA/SMULL) have data dependent durations. In this case an upper bound can be used or 
a non-deterministically chosen value (see Section[8]for details). 

10 Si and di can always be computed from Si and a. 

11 The semantics is s(ro) := s(s(sp) + 4). 

12 The semantics is s(s(sp)) :— s(r±) and s((s(sp) — 4) := s(ro). 



8 



5 Computation of the WCET 



5.1 Modular Definition of WCET 

Given a run g of P, the execution time of g on H from state 7 only depends on TR(g). This implies that 
the WCET of P only depends on Cf(P) and the initial state 7 of H. Consequently if Cf(P) is finite 

WCET( P,H)= max time# (7, w). (1) 

wecf(P) 

The computation of WCET(P, H) thus amounts to (i) generating Cf(P), (ii) feeding H with each w G 
Cf(P) and tracking the maximal execution time. This gives a modular way of computing WCET(P, if) 
since a generator for Cf(P) and the behaviour of the abstract hardware ii" to be fed with Cf(P) can be 
given independently of each other. 

5.2 Extended Domain Abstraction 

In order to take into account all the possible values of the input data, we use an extended domain for the 
values of the main memory cells. We assume here that the values of the registers and predicates are known 
in the initial state. 

Let V± = V U {±} be the extended domain with _L the unknown value. The semantics of instructions 
is extended to this extended domain: for instance, the semantics of add rO , rl , #1 is given by 

[7*0 1 = J- i/([ri] = J_) and [7*1] + 1 otherwise. 

The semantics of comparison instructions e.g., cmp r , r 1 is extended as well to V± e.g., for instruction 
24 of program FIBO, 

N] = JL if(([r ] = J-) or ([n] = J_)) and ({r j = [n]) otherwise. 

When a conditional instruction is encountered and the condition is _L, the extended semantics of the in- 
struction considers two successors: one where the condition is TRUE and the other where the condition is 
FALSE. If a branching instruction like bx lr is encountered and [Zr] = _L the next instruction is undefined 
(we can encode this by jumping to a special "error" state but this situation will not occur in the sequel). 

We may now define an extended symbolic semantics for a program P, and starting from an initial state 
s : 7Z U V U M — » V±, the symbolic semantics define a set of runs (non-determinism may arise if some 
conditions are tested and unknown). 

Assume that the values of the registers and predicates are fixed in the initial program and given by 
so (TZ U V) and the input data is d: the initial state of the memory is so (d) with so(d) : M — >> D. The initial 
state of the program is thus defined by so(7Z U V) • so(d). 

Define sfr : 71 U V U M -> D ± by: sfr(x) = s (x) for x G K U V and sfr(y) = _L for y G M. 

The important property of the extended semantics is (tt): if g is a run of P from state so(d), then £> is a 
run of P from m me extended symbolic semantics. 

In the sequel we write C±(P) for C^ s± y(P) and WCET^(P, ii") = max wG £ 1 (p) time^(7, w). The 
property (tt) of the symbolic semantics implies that Cf (P) C C\_ (P) and by language inclusion we have 

WCET(P,iT)= max time#(7,w)< max time^(7, w) = WCET ± (P, H). (2) 

wecf(P) we£±(P) 

We can thus reduce the computation of (an upper bound of the) WCET(P, ii") to a symbolic simulation of 
program P on the extended domain V± from a unique initial state Sq . 

As we have assumed that termination does not depend on the input data, but is guaranteed for each 
program P, the symbolic simulation of P on the extended domain terminates as well. Each test that ensures 
termination in P cannot evaluate to _L because otherwise it would depend on the input data and this would 
contradict assumption (Al). 
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5.3 WCET Computation as a Reachability Problem 



We can reduce the computation of the WCET to a reachability problem on a network a timed automata. 
Indeed, as C±(P) is finite, it can be generated by a finite automaton Aut(P). The hardware H (including 
pipeline, caches and main memory) can be specified by a network of timed automata Aut(H) (formal 
models are given in Section [8}. Feeding H with C±(P) amounts to building the synchronised product 
Aut(H) xAut(P). On this product we define final states to be the states where the last instruction of P flows 
out of the last stage of pipeline. Assume a freslj^Jclock x is reset in the initial state of Aut(H) x Aut(P). 
The WCET of P on H is then the largest value, max(x), that x can take in a final state of Aut(H) x Aut(P) 
(we assume that time does not progress from a final state). 

We can compute max(x) using model-checking techniques with the tool UPPAAL [8) (see Section [9]). 
To do this, we check a reachability property "(R)' Can we reach a final state with x > KT on Aut(H) x 
Aut(P). If the property is true for K and false for K + 1, K is the WCET of P. We can compute this 
maximal value using the sup operator that gives the maximal value a clock can have in a reachable state. 

Notice that to do this we have to explore the whole state spac^jof Aut(H) x Aut(P). This means that 
to handle large case studies, we need to reduce the state space as much as possible. 

An important point to notice is that the tightness of the WCET we compute depends on an accurate 
description of H. The more precise (time-wise) Aut(H) is, the more precise the computed WCET will be. 
It is thus not reasonable to take a very abstract H (e.g., with caches that always miss) as it will give poor 
WCET estimates. We can still have some control on the automaton Aut(P) that generates the traces to be 
fed to Aut(H). Indeed, we should avoid generating two runs with the same trace as it will give the same 
WCET (from the same initial state of H). This means that minimizing Aut(P) can effectively reduce the 
state space (at least the number of paths explored in the product Aut(H) x Aut(P)). In the next section we 
describe how to compute a reduced program P' that generates the same set of traces as P. 



Program Slicing was introduced by Mark Weiser l32l in 1984. The purpose of program slicing is to com- 
pute a program slice (by removing some statements of the original program) s.t. the slice computes the same 
values for some variables at some given statements. Program slicing is often used for checking properties 
of programs. The reader is refered to [ 31 ] for a survey on the principles of (static and dynamic) slicing. 

6.1 Overview of Program Slicing 

In this section, we assume that we have the control flow graph of P, CFG(P), which is a directed graph, 
the nodes of which are in P. CFG(P) has a single entry node (initial instruction of the program P) and a 
single exit node (that indicates the end of program P). An example of a CFG for the Fibonacci program of 
Listing [TTT] is given in Figure [5] 

A slice criterion C for P is a subset V C P, and for each instruction t G V an associated subset of 
"variables" V(^) C 7ZU V U M. We assume that V(^) is actually included in the set of registers that the 
instruction operates on but this is inessential. For instance, a slice criterion for program FIBO of Figure[5] 
can be instruction 48 : cmp r2, rl and associated set {r*i, r2}. 

Given input data d G V, we write run(P, d) to denote the (unique) run of P on d. Let S C P. The 
runa^lof P and S on input data d G V are denoted 



6 Slicing 




x is not a clock of Aut(H). 

Checking that (R) is false or computing sup clock implies the exploration of all the reachable states. 
Notice that at this stage, run(S, d) may not be finite and program S may not terminate. 
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fib/fib-02.elf 

( ENTRY ) 



1 


r 





mov rl,#30 




r 


4 


mov r2,#2 




r 


8 


add rl,rl,#l 




r 


12 


add r2,r2,#l 




r 


16 


mov rO,#0 




r 


20 


mov r3,#l 




r 


24 


cmp r2,rl 




r 


28 


add r0,r0,r3 




r 


32 


bxeq lr 



36 


add r2,r2,#l 


\ 


r ^ 


40 


add r2,r2,#l 


i 


r 


44 


add r3,r3,r0 


i 


r 


48 


cmp r2,rl 


i 


r 


52 


add r0,r0,r3 


i 


r / 


56 bne 24 




r 


60 


bx lr 



77 ^ 

( END ) 



5. 5. CFP for Program of Listing 
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i.e., instructions not in the subset S are ignored (replaced by e, the empty word) and for instructions in S 
we keep the projection on V(^) of the program state, proj is extended in the natural way to traces and we 
let prof(e) = e and prof(w.(s, l)) = prof (w) .prof (s , l) with s a program state and i G CI. 
S is a slice of P for the slice criterion C if it satisfies, for every input data d G V: 

1 . if P terminates on input d then S terminates on input d and 

2. prof(run(P, d)) = prof (run(S, d)). Notice that by definition of proj, all the instructions of S are in 
prof(run(S, d)) but the projection restricts the set s' k to the variables in V{i' k ). 

In the sequel we recall how to (effectively) compute a slice for P given a slice criterion C. 

6.2 Prerequisites for Computing a Program Slice 

The computation of a slice is based on an iterative solution of dataflow equations on the set of relevant vari- 
ables for each instruction in the CFG of P. The relevant variables for an instruction are the variables read 
from/written to by the instruction. Due to the particular nature of binary programs, the knowledge of rele- 
vant variables for an instruction might not be explicit: consider the instruction foo = str r , [ sp , # 4 ] 
again. This instruction reads register ro and writes to the "variable" which is the memory cell at location 
{sp} + 4. This value is not known at compile time. The previous instruction writes in the stack which is 
particular region of the main memory. Other instructions like 16: str r2, [rl, r3 lsl #2] might 
(read or) write to arbitrary memory cells: in this case the memory cell with addresj^|[ri] + ([rs} « 2). 
In our approach we make the following choice: 

1. we consider that the content of the main memory outside the stack is always _L; this means that we do 
not need to store the main memory content into the program state as it is constant. 

2. by assumption (A2), every access to a stack value is via register sp. We use the term stack reference 
for instructions that read/write sp and main memory reference for the other memory accesses. 

3. for an instruction which has a stack reference, (e.g., in str r , [ sp , # 4 ] ), we only know the actual 
offset at runtime. To define the referenced variables, we introduce a variable stack. This means that we 
track the stack content in the state of the program and this variable is updated by instruction that do 
stack references. The previous instruction thus reads ro and sp and writes to stack. 

This enables us to define formally the set of referenced and defined variables for each instruction, which 
is mandatory in order to compute automatically a slice. 

Given instruction i G X the set of read from (REF) and written to (DEF) variables is given by: 

- for instructions that do not make main memory references or stack references, e.g., i = add r 2 , rl , # 1 
we have REF(i) = {n} and DEF(i) = {r 2 }. 

- for instructions that make stack references, e.g., i = push(r , ri), we define REF{i) = {ro, ri, sp} 
and DEF(i) = {sp, stack}. 

- for instructions that make main memory references, we assume the content of main memory is _L For 
an instruction like i = str r2, [rl, r3 lsl #2] we thus have REF(i) = {ri, r 2 , rs} and 
DEF(i) = 0. Indeed, even if the memory location [ri] + ([7*3] << 2) is written to, the new content 
of the main memory does not depend on the values of the registers and thus we can omit it in the set of 
DEF variables. 

6.3 Step 1: A Slice for Values of Register sp. 

The first task we perform on a binary program P is to compute the possible values of the stack references 
(values of sp). 

We can compute the possible values of the stack pointer sp for a given instruction using a slice criterion 
C: C contains all the instructions that read/write the variable sp i.e., all the instructions s.t. DEF(i)n{sp} ^ 
or REF(i) n {sp} ^ . 

16 The operator << denotes the logical shift left. 
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We compute a slice of P for C, Sc(P) using the standard definition of data dependence and control 
dependence (see 13TI ). Once computed, we do a symbolic simulation of Sc(P) and track the values of sp 
encountered for each instruction in C. 

As we have assumed (Al) that termination does not depend on the input data, but is guaranteed for 
each program P, the symbolic simulation of the slice Sc (P) on the extended domain terminates as well. 
During the course of the symbolic simulation, we track the values of the sp register for each stack reference 
instruction. At the end of the simulation, we obtain the set of possible values for sp at each stack reference 
instruction. Because of the property of the slice, and the symbolic simulation in the extended domain 
(superset of the set of runs) we can ensure that the set of sp values we obtain for each instruction is a 
superset of the set actual values in P. 

Limitations. The previous approach works correctly if the stack is referenced only via the register sp, 
assumption (A2). This is ensured by the API of the compilers from C/C++ to ARM for instance and thus 
is a perfectly reasonable assumption. 

We can take advantage of the computation performed previsouly to narrow the DEF and REF vari- 
ables for each instruction in P. Assume for instruction i = 4 : str rO , [ sp, #4 ] , the set of possible 
values of sp is {12, 16}. What we know about the written to variables is more precise than being somewhere 
in the stack. We know that variables at index 12 and 16 may be written to, and this instruction does not 
modify other stack items at other offsets. We thus refine the definitions of REF and DEF for instruction i 
by setting: REF*(l) = {ro, sp} (unchanged in this case) and DEF*(l) = {stacks, stacks}. This more 
precise definitions will result in smaller subsequent slices as they will introduce less data dependences in 
the CFG of a program. 

In the sequel, we show how to use program slicing to compute a WCET-equivalent program. In the 
next section, we also show how to iteratively use program slicing to build the CFG of arbitrary assembly 
(unstructured) programs. 

6.4 Step 2: Using Program Slicing to Compute a WCET-Equivalent Program 

As in the previous subsection, assume that we have the complete CFG of P (building this CFG is addressed 
in Section [7]). Equation [T] implies that for any two programs P and P' , 

C±(P) = C±(P') =^WCET ± (P,H) =WCET ± (P',H). (3) 

What we would like to do is to compute such a WCET-equivalent program P' which (hopefully) oper- 
ates on a reduced subset of the set of registers 1Z yet contains enough information to generate jCj_ (P). 

Using the previously computed attributes REF* and DEF*, we can compute a WCET-equivalent 
program using an ad-hoc slice criterion C'\ C contains (i) all the instructions that perform main memory 
transactions (including the stack), and each instruction has the associated set of variables that defines the 
memory location, (ii) all the conditional instructions 1 with associated set of variables V(l) 3 p if p is the 
condition of the instruction pj For instance, instruction j = (16: ldr r2, [rl, r3 lsl #2]) is in 
C and we have to track the values of registers V(j) = {ri,rs} since the memory address is defined by n 
and rs. For an instruction like I = (12 : addle rl,r2,#l) we set V(l) = {le}. 

Let Sc(P) be the slice computed using the criterion C . What we want is to generate the language 
£_l(P) using the slice. For each instruction 1 G P we define a corresponding abstracted a(t) as follows: 

- if l G P H Sc (P) then a{u) = 1; 

- for the other instructions 1 G P\ Sc (P), a(t) = L nop where i nop denote the instruction with the exact 
same syntax as 1 but the semantics of i nop is [pc] := [pc] + 4. As the syntax of 1 is identical to i nop , 
this alos preserves the REF and DEF attributes. 

We let a(P) be the program that comprises of instructions ol(l),l G P. Notice that alpha is one-to-one 
mapping and thus we can consider a -1 when needed. 
We can now prove the following Lemmas: 

17 For a conditional memory transaction instruction, both the registers that are needed to compute the referenced 
memory address(es) and the condition are in the associated variables. 



13 



Lemma 1. Let g = SQI0S1I1S2L2 ••• ^k-i^k be a run of P. The run g' — s / Qa(io)s / 1 a(ii)s 2 cx(i2) 
• • • a(ik-i)s' k is in a(P) and TR(g) = TR(g'). 

Proof. We prove the Lemma by induction. The induction hypothesis (IH) is: for runs of length k, TR(g) = 
TR(g') and s' k = proj v ^(sk) if tk is the instruction following l^-i and tk is in the slice, and s' k = 
P ro J{ P c}( s k) otherwise. The Lemma is true for runs of length 0. Assume we have a run of length k + 1 
i.e., g = SQL0S1L1S2L2 • • • J'fc-iSfc^fcSfc+i. First notice that instruction a(^) is a successor of a(^_i) as 
the CFG of a(P) is isomorphic to CFG(P). We can compute the triple (x, A, d) and A 7 , d') added to 
the trace of £ and £>' after instructions ik and a 

- the first component of the triple is the same as alpha(ik) and ik have exactly the same syntax (and 
location); hence x = x' . 

- for the second component, memory references, there are two cases: 

• either does not make any memory transfer and references only registers. The same applies to 
a(tk) and the second component is the empty set; 

• or l has memory references. In this case, the registers that generate the memory references are in 
the slice, and thus the values at Sk and s' k coincide by the "projection" property of the slice. 

In each case A = A'. 

- the third components d, d! are the values of the conditions of the instruction ik and a If the instruc- 
tion ik is unconditional, d = TRUE and d' = TRUE. Otherwise, the two instructions have the condition 
c. As ik is conditional, the condition is in the slice (by definition of the slice) and thus Sk(c) = s' k (c). 
Hence d = d! . 

This proves that TR(g) = TR(g f ) and completes the proof. □ 

Lemma 2. Let g' = s f $ l^s^l^s^l^ • • • tk-\s' k be arunofa(P). There is a run g = SQiQSiiiS2i2 * • * tk-i s k 
ofP with i\ = a(ii) and TR(g) = TR(g'). 

Proof The proof relies on the following fact: every instruction in P that has more than one successor is 
conditional and thus is in the slice p*| Consequently, given two instructions ij = a(ij) and = a(^- + i) 
in the slice, there is a unique sequence (with no loop) of instructions in CFG(P) between lj and Lj+i. This 
shows that there is a (unique) run in P defined by Lj = a _1 (^). Using the result of Lemma [l] completes 
the proof. □ 

By combining Lemma [T] and [2] we obtain: 
Theorem 1. WCET ± (P, H) = WCET ± (a(P), H). 

Proof Lemmas [T] and [2] imply that C±(P) = C±(a(P)) and by Equation|3] the result follows. □ 

When we do program slicing, many operations on registers are avoided if they do not influence the 
control flow. The result is that a(P) generates less states than P: assume register is never used in a(P) 
but used in P, then all the states of P that differs only on r<± are collapsed into the same state. This also 
means that the automaton Aut(a(P)) that generates C±(a(P)) will have less states than Aut(P). Quite 
often, some registers are not used at all or do not influence the control flow and this reduces drastically the 
number of states in Aut{a{P)). 

An example of a slice is given in Figure [6| for the Fibonacci program FIBOq compiled with option 
O0 : only 12 instructions out of 40 need be really simulated and the variables in the sliced program are 
{pc, ro, /*2, ^3} and 3 stack values. 

Another advantage of slicing is that we do not need to do loop unrolling because the registers and 
instructions that control the loop bounds are automatically preserved by the slice. 

In Table [T] Section 10 column "Abs" (a/b) gives, for each program P, the number of nodes a for 



which the simulation of an instruction is needed compared to the total number of nodes b of Aut(P). 

This reduction has not only an effect on the state space (reduction of the number of paths explored) but 
also on the size of the representation of each state of Aut{a{P)). 

In the next section, we describe how we automatically compute the CFG of a program. 



18 We omit here the case of switch statements but they are processed in a similar way and this is implemented in our 
tool. 
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( ENTRY ) 



1 


r 


120 


stmdb sp!,{lr} 




r 


124 


sub sp,sp,#12 




r 


128 


mov r3,#300 


} 


r 


132 


str r3,[sp,#4] 


] 


f 


136 


Idr rO,[sp,#4] 


} 


f 


140 


bl 


\ 


r 





sub sp,sp,#32 


] 


r 


4 


str rO,[sp,#4] 




f 


8 


mov r3,#l 


} 


f 


12 


str r3,[sp,#16] 




f 


16 


mov r3,#0 


} 


r 


20 


str r3,[sp,#20] 




f 


24 


mov r3,#2 




f 


28 


str r3,[sp,#12] 







32 


b 50 



80 
1 — —jf 


Idr r2,[sp,#12] 






r 


84 


Idr r3,[sp,#4] 




r 


88 


cmps r2,r3 




r 


92 | ble 24 




96 Idr r3,[sp,#16] 36 Idr r3,[sp,#16] 



100 str r3,[sp,#28] 40 str r3,[sp,#24] 



104 Idr r3,[sp,#28] 44 Idr r2,[sp,#16] 



108 mov r0,r3 



48 Idr r3,[sp,#20] 



112 add sp,sp,#32 52 add r3,r2,r3 



116 bxlr 



56 str r3,[sp,#16] 



144 mov r3,r0 



60 Idr r3,[sp,#24] 



148 mov r0,r3 



64 str r3,[sp,#20] 



152 add sp,sp,#12 



68 Idr r3,[sp,#12] 



156 Idmia sp!,{lr} 



72 



add r3,r3,#l 



160 bx Ir 



76 str r3,[sp,#12] 



{ END ) 

Fig. 6. WCET-equivalent Slice for Program FIBO . 
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7 Computation of the CFG 



To compute the CFG of a program, we iterate two phases: 



1. Slice. We slice a partial CFG in order to compute the dynamically computed branch targets; we simu- 
late the sliced program to determine these targets. 

2. Expand: having determined the dynamically computed branch targets, we expand the partial CFG and 
repeat Step 1 . 

When the iteration terminates we have the CFG of the program. We limit the scope of our tool to non 
recursive programs, and this ensures that the previous iterative computation terminates. 

We describe the process on an example of a Fibonnaci program FIBOq (compiled with option — OO) 



given in Listing [L2| This program is composed of two functions, main and fib: main calls fib and at the end, 
fib returns. The computation would go like this: after instruction 8c in mam, fib starts as 8c is "(b)ranch to 
and save return address to (l)ink register Zr". If at some point, instruction 74 in fib is reached, lr should 
contain the return address in main i.e., 90. It should also be noticed that the first instruction in main is to 
save on the stack, the return of the caller: push (lr) . This is used at the end of main to return to the 
caller's next instruction when the statement bx lr ("branch to the content of Zr") is performed right after 
popping the value of lr. 



00000000 <fib>: 










0: 


e24dd020 


sub 


sp, 


sp, 


#32 


4 : 


e58d0004 


str 


rO, 


[sp, 


#4] 


8 : 


e3a03001 


mov 


r3, 


#1 




12 : 


e58d3010 


str 


r3, 


[sp, 


#16] 


16: 


e3a03000 


mov 


r3, 


#0 




20: 


e58d3014 


str 


r3, 


[sp, 


#20] 


24 : 


e3a03002 


mov 


r3, 


#2 




28 : 


e58d300c 


str 


r3, 


[sp, 


#12] 


32: 


eaOOOOOa 


b 


50 


<f ib+0x50> 


36: 


e59d3010 


ldr 


r3, 


[sp, 


#16] 


40 : 


e58d3018 


str 


r3, 


[sp, 


#24] 


44 : 


e59d2010 


ldr 


r2, 


[sp, 


#16] 


48 : 


e59d3014 


ldr 


r3, 


[sp, 


#20] 


52 : 


e0823003 


add 


r3, 


r2, 


r3 


56: 


e58d3010 


str 


r3, 


[sp, 


#16] 


60 : 


e59d3018 


ldr 


r3, 


[sp, 


#24] 


64 : 


e58d3014 


str 


r3, 


[sp, 


#20] 


68: 


e59d300c 


ldr 


r3, 


[sp, 


#12] 


72 : 


e2833001 


add 


r3, 


r3, 


#1 


76: 


e58d300c 


str 


r3, 


[sp, 


#12] 


80: 


e59d200c 


ldr 


r2, 


[sp, 


#12] 


84 : 


e59d3004 


ldr 


r3, 


[sp, 


#4] 


88: 


el520003 


cmp 


r2, 


r3 




92: 


daf f f ff 


ble 


24 


<f ib+0x24> 


96: 


e59d3010 


ldr 


r3, 


[sp, 


#16] 


100 


e58d301c 


str 


r3, 


[sp, 


#28] 


104 


e59d301c 


ldr 


r3, 


[sp, 


#28] 


108 


ela00003 


mov 


rO, 


r3 




112 


e28dd020 


add 


sp, 


sp, 


#32 


116 


e!2fffle 


bx 


lr 






00000078 <main>: 










120 


e52de004 


push 


{lr} ; 


stmdb 


124 


e24dd00c 


sub 


sp, 


sp, 


#12 


128 


e3a03f4b 


mov 


r3, 


#300 




132 


e58d3004 


str 


r3, 


[sp, 


#4] 


136 


e59d0004 


ldr 


rO, 


[sp, 


#4] 


140 


ebffffdb 


bl 





<fib> 




144 


ela03000 


mov 


r3, 


rO 




148 


ela00003 


mov 


rO, 


r3 




152 


e28dd00c 


add 


sp, 


sp, 


#12 


156 


e49de004 


pop 


{lr} ; 


ldmia 


160 


el2f f fie 


bx 


lr 







sp!, {lr}; 



sp ! 



{lr}; 



Listing 1.2. FIBO 
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If we perform a first unfolding of the program, we obtain a partial CFG depicted in Fig. [7] In this CFG, 
the successor of instruction 116 is unknown and thus the unfolding has a terminal node at this location. To 
compute the successor of this insruction we slice the partial CFG with the slice criterion C" = {116} and 
V(116) = {lr}. The sliced program is composed of the red nodes i.e., instructions 140 and 116. Simulating 
this two-instruction program we get the possible value of lr at instruction 116 which is 144. 

We can then extend the partial CFG to obtain the graph depicted on Fig. [5] We slice again to compute 
the successor of instruction 160: the new slice (6 nodes) is depicted on Fig.[5]with the red nodes. We should 
here find that main handles the control back to its caller. To recognise this situation we use the following 
trick: we assume that before the first instruction of the program is performed, \lr\ = (3 where (3 is a special 
value that cannot correspond to any valid instruction. We can take for example (3 = 3. When we compute 
a target which is (3 we know that we have reached the end of the program because this returns to the caller. 
This situation occurs when we simulate the second slice and after the instruction 1 6 : bx 1 r the program 
return to the caller. 

The complete CFG for FIBOo is given in Fig. [9] 

The computation of the possible values of sp described in Section 63 is actually performed when 
computing the CFG. When we have computed the final CFG we also have the possible values of register 
sp (at the stack reference node) and we can directly proceed to Step 2 (section |6.4| ) to compute a WCET- 
equivalent program. 

The previous process always converge to the CFG of a program because we assume that the programs 
do not contain recursive calls (assumption (A4)). In the worst case, the slices we need to simulate in the 
iterative compuation are the full CFGs obtained at each step. 



8 Hardware Model 

In this section we present some features of the formal models (timed automata) of the hardware. The 
automata are given using the UPPAAL syntax: initial locations are identified by double circles, guards are 
green, synchronization signals (channels) are light blue and assignments are dark blue. A C in a location 
means committed: when an automaton enters a committed location, it cannot be interrupted and proceeds 
immediately to one of the successors of this location (the guards determine the transitions that can be 
taken). The UPPAAL models are available from |http : //www . irccyn . f r/f ranck/wcet| 



8.1 Main Memory 



The main memory model is a very simple two-location automaton (Fig. 10). When a memory transfer is 
required, signal MainMemSt art ? is received and clock t is reset. After a delay of MAINMEMTRANS the 
transfer is completed and signal MainMemEnd ! is issued. Main memory transfers are triggered by either 
the instruction or data cache and accesses to main memory is serialized. 



8.2 Caches 

The model of the instruction cache is given in Fig. [TT] The state of the cache contains an array (64 x 8 
array) to record the addresses stored in the cache and whether a line is dirty or not. 

The instruction cache is simpler than the data cache because no write can occur in this cache, so 
a line cannot be dirty. After the initialization of the cache (initial state of the cache by the function 
initCache ( ) ), the automaton is ready for receiving the signal CacheReadStart [num] ?. This sig- 



nal will be triggered by the fetch stage of the pipeline Fig. [12] The memory address to read is m. If m is 
in the cache (function is_in (m) returns TRUE), there is no need for a memory transfer and variable PMT 
(Pending Memory Transfers) is assigned 0. Otherwise function insert (m) inserts m in the cache and 
returns the number of memory transfers to be performed: for the instruction cache it is always 1 because a 
line cannot be dirty (see Section [3]) but for the data cache it can be either one or 2 if a dirty line has to be 
saved from the cache. As soon as the memory transfer is completed (PMT=0) transition Hurry ! is fired (it 
is urgent). Then, after CACHE SPEED time units (value is 1 for the our testbed) the read request completes 
and the signal CacheReadEnd [num] ! is issued. 
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( ENTRY ) 



1 


r 


120 


stmdb sp!,{lr} 




r 


124 


sub sp,sp,#12 




r 


128 


mov r3,#300 




r 


132 


str r3,[sp,#4] 


} 


r 


136 


Idr rO,[sp,#4] 


} 


f 


140 


bl 


} 


r 





sub sp,sp,#32 


} 


r 


4 


str rO,[sp,#4] 


} 


r 


8 


mov r3,#l 


} 


r 


12 


str r3,[sp,#16] 


} 


r 


16 


mov r3,#0 


} 


r 


20 


str r3,[sp,#20] 


} 


r 


24 


mov r3,#2 


} 


r 


28 


str r3,[sp,#12] 



80 1 


Idr r2,[sp,#12] 






r 


84 


Idr r3,[sp,#4] 




r 


88 


cmps r2,r3 




r 


92 ble 24 




36 


Idr r3,[sp,#16] 


i 




40 


str r3,[sp,#24] 






44 


Idr r2,[sp,#16] 






48 


Idr r3,[sp,#20] 


i 


r 


52 


add r3,r2,r3 




f 


56 


str r3,[sp,#16] 




f 


60 


Idr r3,[sp,#24] 



96 


Idr r3,[sp,#16] 






100 


str r3,[sp,#28] 


} 


r 


104 


Idr r3,[sp,#28] 


} 


r 


108 mov r0,r3 


} 


r 


112 


add sp,sp,#32 




r 


116 


bx Ir 


} 


r 


EXIT_0xl00807a00 



64 str r3,[sp,#20] 



I 



68 Idr r3,[sp,#12] 



72 add r3,r3,#l 







32 


b 50 



76 str r3,[sp,#12] 



Fig. 7. First Unfolding of the CFG of FIBO . 
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( ENTRY ) 
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mov r3,r0 
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mov r0,r3 
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add sp,sp,#12 
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Idmia sp!,{lr} 
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bx Ir 



EXIT_0xl00807a00 
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r 
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str r3,[sp,#24] 
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Idr r2,[sp,#16] 
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68 


Idr r3,[sp,#12] 
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Idr r3,[sp,#4] 



Fig. 8. Second Unfolding of the CFG of FIBO . 
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( ENTRY ) 



1 


r 


120 


stmdb sp!,{lr} 
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16 


mov r3,#0 
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str r3,[sp,#20] 
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28 


str r3,[sp,#12] 







32 
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96 Idr r3,[sp,#16] 
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144 mov r3,r0 



148 mov r0,r3 



152 add sp,sp,#12 
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f 


64 


str r3,[sp,#20] 




f 
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( END ) 

Fig. 9. The Complete CFG of FIBO . 
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MainMemStart? t=0 




t<=MAINMEMTRANS 

t==MAINMEMTRANS 
Main Mem End! 



Fig. 10. Main Memory TA 



x==CACHE_SPEED 
CacheReadEnd[num]! 



initialize? 
initCacheQ 



o 




MainMemEnd? 



PMT- 



PMT>0 && m>=0 

MainMemStart! 
ICcachemiss++ 



PMT==0 Hurry! x=0 




CacheReadStart[num]? 
PMT=is_in(m)?0:insert(m) 



x<=CACHE_SPEED 

Fig. 11. Instruction Cache 
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The data cache is a bit more involved (Fig. [13]). For a read/hit operation it behaves almost like the 



instruction cache described above. For write operations, a write buffer (not given here) is used and moreover 
the timing depends on the type (load/store), addresses involved in the operation, and whether another 
write/read operation is already in progress (and to which line in the write buffer). We have tried to design 
an accurate model of the data cache: data cache operations are the major factor in the WCET for most of 
the programs and a faithful model is required to compute tight bounds. How the model of the data cache 
was built is described in Section [T031 



8.3 Pipeline Model 

The model of the pipeline is rather simple except the memory stage (M) which is a bit more complicated. 
The F stage automaton fetches the next instruction if no branch delay stall occurs (see Section 10. 3| ). The 



function s t al 1 ( ) of the F stage automaton determines whether such a stall should occur or not. If the next 
instruction can be fetched, it is fetched from the instruction cache CacheReadSt art [ INSTR.CACHE ] ! 
(this signal is urgent and synchronized with the instruction cache). When the fetch is completed the instruc- 
tion is transferred to the next D stage, as soon as it is ready to be fed with a new instruction. The D stage, E 
stage and W stage are similar. Notice that the duration of an instruction may vary from one instruction to the 
other (e.g., long multiplication may take longer than additions) or because a conditional instruction is not 
executed: the actual duration is set when a new instruction arrives in the E stage (DUR_INSTR=dur ( ) ). 
A special signal prog_completed? is received from the program and marks the last instruction of the 
program. The program is completed when this last instruction flows out of the last stage (W) of the pipeline 
and this corresponds to reaching location DONE of the W stage. The automaton for the M stage is given in 



Fig. 13 when an instruction is performed and it is a memory transaction, it issues a sequence of read/write 



requests to the data cache. 



9 Implementation 

We have implemented the construction of the CFG (Section|7]) and the computation of the WCET-equivalent 
program (Section]?]). The architecture of our tool is given in Fig. [14] Together with a parser of ARM binary 
programs it comprises several thousand C++ lines of code. We have implemented very efficient versions of 
post-dominators algorithms 1231181 and post dominance frontiers algorithms [ 1 3 1 as they are used inten- 
sively both in Compute CFG and Compute WCET-equiv. To obtain the binary program we use the 
GCC tool suite (gcc, objdump) from Codesourcery fT2l . 

Our tool produces a bundle of files: a ready-to-analyse file containing the UPPAAL timed automata 
models of the program P' and the hardware model^j(7FG(P / ); a dot file with the graph of P' and a 
ready-to-compile C++ file that contains a simulator of the program P' . This last file can be compiled and 
used to compute useful information like the ranges of registers. Notice that during the first phase Compute 
CFG we compute the range of the stack pointer and thus the tool can also be used as a stack analyser. To 



compute the WCET we check property R(K) (Section [53] ) using Uppaal. 

For the binary programs we have analysed, the time it takes to compute the output file from a binary 
program is negligible (less than a second). The automata of the programs of Table [T] and the dot graphs are 
available from http : / /www . irccyn . f r/f ranck/wcet[ 

10 Experiments 
10.1 Methodology 

The program P to analyse is encapsulated in a template function: an example of use is given for program 
FIBOin Listing [Q] 



The layout of the CFG is produced using dot, http : / /www . graphviz . org/ 
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fetch_completed 



prog_completed? 



decode! 
copy(me,me+1) 



CacheReadStart[INSTR_CACHE]! 




CacheReadEnd[INSTR_CACHE]? 

F Stage 
decode? t=0 



fetch_completed? 
decode_completed! 




O !stal| o 

execute! 
copy(me,me+1) 



o 



D stage 



execute? 
t=0,DUR_INSTR=dur() 



Ot<=i 



decode_completed? 



execute_completed 




DUR INSTR 



t==DUR INSTR 



E Stage 

writeback? t=0 



memory_completed? 

done(c) 




t<=CYCLE 



clean() t==CYCLE 
W Stage 



Fig. 12. Timed Automata for the Fetch, Decode, Execute and WriteBack Stages. 
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CacheReadEnd[DATA_CACHE]? 



CacheWriteEnd[DATA_CACHE]? 



is_ldx() 

CacheReadStart[DATA_CACHE]! 
CD=dataAdr[me], 
num_word[me]-- 



Todo[me-1] && is_mem_transaction() 
memory? 
t=0 



!is_ldx() 

CacheWriteStart[DATA_CACHE]! 

CD=dataAdr[me], 

num_word[me]-- 



num_word[me]>0 



execute_completed 



memory_completed 




Fig. 13. Memory Stage and Data Cache 
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Binary 
Program P 



Compute 
CFG 








Compute 
WCET-equiv 






— > 


CFG(P) 


— > 


— > 


p' 














UPPAAL 
CFG{P') 



Fig. 14. Tool Chain Overview 



#define timerToCPUClockRatio 12 

main () 
{ 

int result; 

unsigned int start; 
unsigned int stop; 

start = timerGetValue ( 1 ) ; 

result = fib (300) ; 

stop = timerGetValue ( 1 ) ; 

printf ( "f ib (300) : 1 _,%d, ,_,time=%lu\n" , result, 

(stop-start) *timerToCPUClockRatio) ; 

while (l); 

j 

Listing 1.3. Code snippet of instrumentation with FIBO 



Given P, we let t(P) be the encapsulated program. Measuring the execution time of P consists in (1) 
reading a hardware timer (timerGetValue) into a start variable, (2) calling the program P, and (3) 
reading the timer again into a stop variable and (4) printing] the difference stop — start. The function 
timerGetValue (assembly code) has been designed to read a hardware timer (See next paragraph). 

The measurement error is is +/-12 processor cycles. The program t(P) is compiled and linked. Running 
it on the ARM9 will print out the number of cycles taken by the program P: this figure is given in column 
"Measured WCET" in Tableffl 

To faithfully compute the WCET of P using our method, we take as input of our tool chain t(P). t(P) 
is transformed (using Compute CFG and Slice) into an UPPAAL automaton as described in Section [9] In 
this automaton a dedicated clock GBL_CLK is reset when the instruction^] of t(P) that reads the hardware 
timer flows out of the M stage (reading the timer in function timerGetValue is done using a load 
instruction). The final state of the automaton is reached when the second occurrence of the instruction that 
reads the timer flows out of the W stage. The computed WCET is given in column "Computed WCET" in 
in Table [T] Column "UPPAAL" in Table [T] gives the time UPPAAL takes to check the reachability property 
"Is it possible to reach a final states with GBL_CLK > K + 1 ?" and this property is false and was true for 
K. In this case K is the computed WCET. 



10.2 Measuring Time on the Hardware 

Measuring execution time on the hardware may be done by using an external device like an oscilloscope 
or by using one of the embedded hardware timers. In both cases, the program must be instrumented. In the 

20 the Armadeus APF9328 board has a serial interface and in-rom drivers and printf function. 

21 We can identify this instruction in timerGetValue. 
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first case, using a General Purpose I/O (GPIO) device, a signal is set to 1 at the start of the measure and to 
at the end and the oscilloscope measures the time between the rising and the falling edge. In the second 
case, a free running timer is launched. It is read at the start and at the end of the measure. The difference 
of both values gives the execution time. This supposes the clock frequency of the hardware timer is close 
enough to the clock frequency of the processor to allow accurate measurements. By close enough we fix 
the measurement error to less than +/-1% of the measurement. So a hardware timer clock frequency two 
orders of magnitude lower than the processor clock frequency would be accurate enough if the program to 
measure executes in > 10000 cycles. 

On the MC9328MXL the maximum available frequency for the hardware timers is j^th the processor 
clock frequency. So a program executing in > 1200 cycles may be accurately measured (less than 1% 
error). 



10.3 Tuning the Hardware Model 

The ARM9TDMI Technical Reference Manual |2 ] gives pipeline timings according to the kind of instruc- 
tions together with some examples of load delays and branch delays. However these timing information 
about the ARM920T processor and the MC9328MXL micro-controller are not enough to design accurate 
formal models of the hardware. 

To overcome this, we have carefully crafted programs to stress particular features of the hardware 
and determine the precise timing of some sequences of instructions. The basis of this identification phase 
consists in measuring the difference in execution times of two variants of the same loop. The second variant 
contains a sequence of instructions for which we want a precise timing. The execution time difference 
between the two variants is the execution time of this sequence multiplied by the number of iterations. 
Using a large number of iterations minimizes the measurement error. 

For memory accesses, variants may differ only by the memory alignment of data because timings may 
be different if a subsequent cache access is done in the same cache set or in a distinct cache set. And this 
can have a huge impact on the computed WCET if not modelled properly. 

To remove the execution time of the measurement code, the loop is executed twice, one with 10000 
turns and one with 20000 (for instance). The difference of execution time is the execution time of 10000 
turns. The loop is dried run to copy it into the instruction cache. 

Running a large set of special-purpose programs, we were able to refine the model of the data cache 
and obtain a rather precise formal model (see Fig. [13]). 



10.4 Test program example 

This methodology allowed us to work out an undocumented behavior of the data cache. The loop in List- 
ing |1.4| is executed 10000 times and 20000 times and the difference is 70000 cycles. This result is consistent 
with the timing of the instructions found in El since the instructions in the loop take 7 cycles to execute 
(execution time of each instruction is given as comment in listing [T~4]). 



.global ld_follow_ 


.St 




ld_f ollow_st : 






ldr r2, [rl, #0] 


@ 


preload both addresses 


ldr r2, [rl, #16] 


@ 


in the data cache 


ld_f ollow_st_loop : 






str r2, [rl, #0] 




@ 1 cycle 


ldr r2, [rl, #16] 




@ 1 cycle 


sub rO, rO, #1 




@ 1 cycle 


cmp r0,#0 




@ 1 cycle 


bgt ld_follow_st 


_loop 


@ 3 cycles 


bx lr 







Listing 1.4. Data cache timing behavior test 



However when the argument passed in rl (the base address used to do the store and the load) is offset 
by 16 bytes, the execution time is 80000 cycles because the instructions in the loop take 1 extra cycle to 
execute. 
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The data cache has 64 sets and 32 bytes per line. So, the index is located in bits 10 to 5 of the address. 
In the first case, with [n] = 0x8004<i94 and [rj + 16 = 0x8004<ia4, the indexes are different. In the 
second case, with [n] = 0x8004<ia4 and [ri] + 16 = 0#8004d&4, the indexes are equal. So, after a store 
in a set, an access to the same set incurs a 1 cycle stall. 



10.5 Experiments on Benchmark Programs 

The results we have obtained on some benchmark program^Jfrom Malardalen University fT9l are reported 
in Table [T] The programs we have analysed are available from http : / /www . irccyn . f r/f ranck/| 
|wcet| we have archived the C source program, the (de-assembled) encapsulated binary program (.arm 
file), the UPPAAL model (and property) and the dot graph. We have not given the time it takes to do the 
slicing because it is less than a second. Regarding the benchmarks themselves, we point out that: 

- the difficulty of measuring the WCET is not related to the size of the program; some programs are huge 
but contain a few paths, others are very compact but have a huge number of paths. 

- they are designed to be representative of the difficulties encountered when computing WCET: for 
instance janne-complex contains two loops and the number of iterations of the inner loop depends on 
the current value of the counter of the outer loop (in a non regular way). 

- we have experimented on different compiled versions of the same program (options O0, 01, 02) 
because the binary code produced stresses different parts of the hardware. 

- we have checked various cases of the same programs with different initial stack pointer alignment, . . . 

- we have multiplied the number of iterations of the benchmarks (e.g., we compute the execution time 
of Fib(300W\; this way a modelling error (e.g., that adds 1 cycle per iteration) is revealed and will 
incur a huge over- approximation. 

In this sense the programs we have experimented on should not be considered too easy. 
The results in Table Q] are divided into three main sections: 

- Single-Path programs. The results of this section show that the abstract models (program and hardware) 
we have designed are adequate for obtaining tight bounds for the WCET. Even for janne-complex and 
its intriguing inner loop counts that depend on the outer loop counter, the maximum error is 3.2%. 
This also validates the accuracy of the program model we have computed (using slicing and no loop 
unrolling nor maximum loop bounds). 

- Single-Path programs with data dependent instruction durations. Instructions like MUL/MLA can take 
between 3 to 6 cycles in the E stage (and SMULL 4 to 7). This section highlights one of the advantages 
of the timed automata models of the hardware. Indeed, in the timed automaton of the E stage (Fig.[T2|), 
we can replace the guard t==DURATION with MINDUR<= t <= MAXDUR and (add the assignments 
to MINDUR and MAXDUR). With this new E stage, we compute an interval for the WCET. Notice 
that this model is robust against timing anomalies because we explore the state space without any 
assumption like "always the shortest duration" or "always the largest duration"; the duration of the 
instruction is picked non-deterministically in [MINDUR,MAXDUR] every time the transition is taken. 
This explains the difference between the computed and the measured WCETs because in the measured 
WCET the worst-case duration for the MUL/MLA/SMULL instructions is never encountered. In this 
case, column ^^/^ of Table[l]does not represent the over-approximation of the computed WCET but 
rather the under-approximation of the measured WCET with the chosen input data. 

- Multiple-path programs. These programs contain some branching that are input data dependent. The 
measured WCET is the execution time (on the hardware) obtained with input data that are supposecp] 
to produce the WCET. The computed WCET result considers all the possible input data. For bs- 



00,01,02 the WCET is very small and measurement errors are more than 1% (see Section [KI2 ). 
Program cnt starts with the initialization of a 10 x 10 matrix. In cnt-02, the compiler unrolls the ini- 
tialization loop to a list of 100 consecutive store instructions. So cnt-02 stresses the write buffer and 



http : //www.mrtc . mdh. se/pro jects/wcet /benchmarks .html 



Even if we cannot compute Fib(300) we can compute the time it takes to compute it. 
24 Note that the benchmark programs usually indicate which data should give the WCET but in some cases this is 
erroneous. 
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we have to take into account the fact that the Write Buffer may be full. In this case, the data cache has 
to wait to make a write until the write buffer is not full. 

Compared to existing methods and results our method has several advantages: 

- computation of the CFG and of the reduced program automaton is fully automated (no loop bounds 
annotation needed); 

- we use concrete caches and a detailed models of the hardware; 

- the model of the hardware can be tuned easily (e.g., durations of instructions can be an interval instead 
of a fixed value); as emphasised in |[T0ll . changes in the processor speed can also be modelled easily 
(using a timed automaton that sets the processor speed). This enables us to compute WCET with power 
related constraints. Another advantage is that changing the processor (e.g., ARM7) requires only to 
change the pipeline automata. 

- we compare the computed results to actual execution times using a rigorous protocol. The relative error 
in the computed results can be assessed and the results show that our method and models give very 
tight bounds. 



11 Conclusion and Future Work 

In this paper we have presented a framework based on program slicing and model-checking to compute 
WCET for programs running on architectures featuring pipelining and caching. We have exemplified the 
method by providing formal models of the ARM920T. Moreover we have compared the computed results 
with actual execution times on the real hardware. Our method is modular and altering the model of the 
hardware can be done easily using the timed automata models and the CFG is computed automatically. 

In some cases there are a huge number of paths to be explored and there is no hope that an exhaustive 
search will compute any result in a life-time. Examples of such programs are multiple-path programs 
(e.g., program binary sort) with a lot of input data dependent branchings. To overcome this problem we are 
developing a branch and bound techniques. We are also currently extending the framework to handle: 

- generation of traces: UPPAAL can generate a witness symbolic trace of a path yielding the WCET. 
From this symbolic trace, we want to compute initial values of the input data that produce this trace. 
This can be achieved using techniques similar to Counter Example Guided Abstraction Refinement 
(CEGAR) ED. 

- co-processor calls. This can be achieved by adding a timed automaton model of the co-processor. 

- for some programs like OS kernels, interrupts can be generated and trigger interrupt handlers. Comput- 
ing the WCET in this case is not easy as it requires a model of the interrupts arrivals e.g., "the interval 
between two interrupts of type i is at least t time units". We can model interrupts arrivals using timed 
automata. 

Acknowledgements. The authors wish to thank Tim Bourke for the careful proof-reading of the paper and 
many helpful comments. 

References 

1. Armadeus systems. 

2. ARM9TDMI Technical Reference Manual ARM Limited, 2000. 

3. Application Binary Interface for the ARM Architecture . ARM Limited, 2009. 

4. Abslnt Angewandte Informatik. aiT Worst-Case Execution Time Analyzers. |http : //www . absint . com/| 
|ait/| 

5. R. Alur and D. Dill. A theory of timed automata. Theoretical Computer Science, 126(2): 183-235, 1994. 

6. ARM Limited. Application Note 93 - Benchmarking with ARMulator. http://infocenter.arm.com/ 
help/topic/com. arm. doc . daiOO 93a/DAl00 93A_benchmarking_appsnote . pdf 



28 



Program 


loc f 


UPPAAL 
Time/States Explored^ 


Computed 
WCET (C) 


Measured 
WCET (M) 


(C " M) x 100 


Abs § 


Single-Path Programs 


fib-OO 


74 


1.74s/74181 


8098 


8064 


0.42% 


47/131 


fib-Ol 


74 


0.61s/22332 


2597 


2544 


2.0% 


18/72 


fib-02 


74 


0.3s/9710 


1209 


1164 


3.8% 


22/71 


janne-complex-OO* 


65 


1.15s/38014 


4264 


4164 


2.4% 


78/173 


j anne -complex- 1 * 


65 


0.48s/14600 


1715 


1680 


2.0% 


30/89 


janne-complex-02* 


65 


0.46s/13004 


1557 


1536 


1.3% 


32/78 


fdct-Ol 


238 


1.67s/60418 


4245 


4092 


3.7% 


100/363 


fdct-02 


238 


3.24s/55285 


19231 


18984 


1.3% 


166/3543 


Single-Path Programs 1 with MUL/MLA/SMULL instructions (instructions durations depend on data) 


fdct-OO 


238 


2.41s/85007 


[11242,11800] 


11448 


3.0% 


253/831 


matmult-OO* 


162 


5m9s/10531230 


[502850,529250] 


[511584,528684] 


0.1% 


158/314 


matmult-Ol* 


162 


lm32s/l 122527 


[130001,156402] 


[127356,153000] 


2.2% 


71/172 


matmult-02* 


162 


43.78s/1780548 


[122046,148299] 


[116844,140664] 


5.4% 


75/288 


jfdcint-OO 


374 


2.79s/100784 


[12699,12699] 


12588 


0.8% 


159/792 


jfdcint-Ol 


374 


1.02s/35518 


[4897,4899] 


4668 


7.0% 


25/325 


jfdcint-02 


374 


5.38s/175661 


[16746,16938] 


16380 


3.4% 


56/2512 


Multiple-Path Programs 


bs-OO 


174 


42.6s/1421474 


1068 


1056 


1.1% 


75/151 


bs-Ol 


174 


28s/1214673 


738 


720 


2.5% 


28/82 


bs-02 


174 


15s/655870 


628 


600 


4.6% 


28/65 


cnt-OO* 


115 


2.3s/76238 


9028 


8836 


2.1% 


99/235 


cnt-Ol* 


115 


ls/27279 


4123 


3996 


3.1% 


42/129 


cnt-02* 


115 


0.5s/11540 


3065 


2928 


4.6% 


39/263 


insertsort-OO* 


91 


10m35s/24250737 


3133 


3108 


0.8% 


79/175 


insertsort-Ol* 


91 


7m2s/l 1455293 


1533 


1500 


2.2% 


40/115 


insertsort-02* 


91 


11.5s/387292 


1371 


1344 


2.0% 


43/108 


ns-OO* 


497 


83.4s/3064315 


30968 


30732 


0.8% 


132/215 


ns-Ol* 


497 


11.3s/368719 


11701 


11568 


1.1% 


61/124 


ns-02* 


497 


29s/l 030746 


7343 


7236 


1.4% 


566/863 



lines of code in the C source file 



^ x 100 computed using the upper bound for C (see Section 



10.51 



% (C-M) 
M 

§ Non Abstracted instructions/Instructions 
*Program selected for the WCET Challenge 2006 
^Time in min/seconds on Intel Dual Core i3 3.2Ghz 8GB RAM 

Table 1. Results, file-ox indicates that file was compiled using gcc -ox (optimization option). 
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